Project: Statistics Clothes and accessories: Sales and Product Details¶

Minh Anh Do¶

Examining Sales, Categories & Geography for Products¶

About Dataset¶

Delve into the intricate world of fashion commerce with our dataset, 'Wardrobe Economics: Examining Sales, Categories, and Geography for Products (Clothes and Accessories'). This comprehensive collection of data opens the door to a wealth of insights waiting to be discovered. Gain a profound understanding of consumer preferences and buying behavior as you immerse yourself in the intricate details of sales transactions, diverse product categories, and the geographical spread of fashion trends.

Unveil the hidden narratives behind each transaction, revealing the ebb and flow of consumer demands across a spectrum of stylish offerings. Delve into the intricacies of product categories, allowing you to identify burgeoning trends and enduring classics that define the ever-evolving world of fashion. Explore the geographical distribution of these trends, shedding light on regional nuances and preferences that shape the global fashion ecosystem.

With 'Wardrobe Economics,' you'll be empowered to conduct in-depth analyses that illuminate the path to success in the realm of clothing and accessories. Whether you're a trend-savvy retailer, a market researcher, or a curious enthusiast, this dataset beckons you to unravel its insights and unlock a deeper appreciation for the intricate interplay of sales, categories, and geography within the realm of fashion.

In [ ]:
!pip install fasteda
In [ ]:
!pip install datacleaner

Importing Libraries¶

In [3]:
#for eda
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

import numpy as np

from fasteda import fast_eda
from datacleaner import autoclean

import scipy
import scipy.stats as stats

Data Loading¶

In [4]:
df = pd.read_csv(r'C:\Users\manht\Documents\formation-python\Clothes and Accessories Sales and Product Details\Sales_Product_Details.csv')
df.head()
Out[4]:
Date Customer_ID Product_ID Quantity Unit_Price Sales_Revenue Product_Description Product_Category Product_Line Raw_Material Region Latitude Longitude
0 20210601 98 321 1 117.306016 117.306016 Cycling Jerseys Sports Tops Fabrics York 53.958332 -1.080278
1 20210602 92 261 4 32.272403 129.089613 Casual Shirts Menswear Tops Cotton Worcester 52.192001 -2.220000
2 20210603 92 264 1 36.193364 36.193364 Casual Shirts Menswear Tops Cotton Worcester 52.192001 -2.220000
3 20210604 99 251 3 29.913403 89.740210 Jeans Menswear Trousers Cotton Winchester 51.063202 -1.308000
4 20210605 66 251 1 41.843430 41.843430 Shorts Womenswear Trousers Cotton Winchester 51.063202 -1.308000

Data Inspection¶

In [5]:
df.shape
Out[5]:
(30, 13)
In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Date                 30 non-null     int64  
 1   Customer_ID          30 non-null     int64  
 2   Product_ID           30 non-null     int64  
 3   Quantity             30 non-null     int64  
 4   Unit_Price           30 non-null     float64
 5   Sales_Revenue        30 non-null     float64
 6   Product_Description  30 non-null     object 
 7   Product_Category     30 non-null     object 
 8   Product_Line         30 non-null     object 
 9   Raw_Material         30 non-null     object 
 10  Region               30 non-null     object 
 11  Latitude             30 non-null     float64
 12  Longitude            30 non-null     float64
dtypes: float64(4), int64(4), object(5)
memory usage: 3.2+ KB
In [7]:
df.dtypes
Out[7]:
Date                     int64
Customer_ID              int64
Product_ID               int64
Quantity                 int64
Unit_Price             float64
Sales_Revenue          float64
Product_Description     object
Product_Category        object
Product_Line            object
Raw_Material            object
Region                  object
Latitude               float64
Longitude              float64
dtype: object
In [8]:
df.describe()
Out[8]:
Date Customer_ID Product_ID Quantity Unit_Price Sales_Revenue Latitude Longitude
count 3.000000e+01 30.000000 30.000000 30.000000 30.000000 30.000000 30.000000 30.000000
mean 2.021062e+07 57.400000 279.733333 2.066667 40.498759 79.687953 52.237571 -2.270437
std 8.803408e+00 31.457251 24.259990 1.142693 16.742578 43.008559 1.449567 1.367443
min 2.021060e+07 9.000000 251.000000 1.000000 21.965812 21.965812 50.259998 -5.051000
25% 2.021061e+07 32.500000 262.250000 1.000000 32.390679 36.774078 51.063202 -2.647000
50% 2.021062e+07 51.000000 276.000000 2.000000 36.191911 79.261696 52.192001 -1.490000
75% 2.021062e+07 90.750000 287.500000 3.000000 44.341442 113.761987 53.680000 -1.353500
max 2.021063e+07 99.000000 357.000000 4.000000 117.306016 175.486148 53.958332 -1.080278
In [9]:
df.isnull().sum()
Out[9]:
Date                   0
Customer_ID            0
Product_ID             0
Quantity               0
Unit_Price             0
Sales_Revenue          0
Product_Description    0
Product_Category       0
Product_Line           0
Raw_Material           0
Region                 0
Latitude               0
Longitude              0
dtype: int64
In [10]:
df.duplicated().sum()
Out[10]:
0

Data Cleaning¶

In [11]:
df = autoclean(df)
df.head()
Out[11]:
Date Customer_ID Product_ID Quantity Unit_Price Sales_Revenue Product_Description Product_Category Product_Line Raw_Material Region Latitude Longitude
0 20210601 98 321 1 117.306016 117.306016 3 2 2 2 5 53.958332 -1.080278
1 20210602 92 261 4 32.272403 129.089613 1 1 2 1 4 52.192001 -2.220000
2 20210603 92 264 1 36.193364 36.193364 1 1 2 1 4 52.192001 -2.220000
3 20210604 99 251 3 29.913403 89.740210 7 1 3 1 3 51.063202 -1.308000
4 20210605 66 251 1 41.843430 41.843430 12 3 3 1 3 51.063202 -1.308000

Exploratory Data Analysis¶

In [12]:
fast_eda(df)
DataFrame Head:
Date Customer_ID Product_ID Quantity Unit_Price Sales_Revenue Product_Description Product_Category Product_Line Raw_Material Region Latitude Longitude
0 20210601 98 321 1 117.306016 117.306016 3 2 2 2 5 53.958332 -1.080278
1 20210602 92 261 4 32.272403 129.089613 1 1 2 1 4 52.192001 -2.220000
2 20210603 92 264 1 36.193364 36.193364 1 1 2 1 4 52.192001 -2.220000
DataFrame Tail:
Date Customer_ID Product_ID Quantity Unit_Price Sales_Revenue Product_Description Product_Category Product_Line Raw_Material Region Latitude Longitude
27 20210628 56 262 1 33.470936 33.470936 5 1 2 5 5 53.958332 -1.080278
28 20210629 13 286 1 32.745507 32.745507 8 3 2 0 2 51.209000 -2.647000
29 20210630 91 291 1 31.879107 31.879107 8 1 2 0 2 51.209000 -2.647000
----------------------------------------------------------------------------------------------------
Missing values:
  0
----------------------------------------------------------------------------------------------------
Shape of DataFrame:

(30, 13)

----------------------------------------------------------------------------------------------------
DataFrame Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Date                 30 non-null     int64  
 1   Customer_ID          30 non-null     int64  
 2   Product_ID           30 non-null     int64  
 3   Quantity             30 non-null     int64  
 4   Unit_Price           30 non-null     float64
 5   Sales_Revenue        30 non-null     float64
 6   Product_Description  30 non-null     int32  
 7   Product_Category     30 non-null     int32  
 8   Product_Line         30 non-null     int32  
 9   Raw_Material         30 non-null     int32  
 10  Region               30 non-null     int32  
 11  Latitude             30 non-null     float64
 12  Longitude            30 non-null     float64
dtypes: float64(4), int32(5), int64(4)
memory usage: 2.6 KB
----------------------------------------------------------------------------------------------------
Describe DataFrame:

  count mean median std min 25% 50% 75% max skewness kurtosis
Date 30 20210615.5 20210615.5 8.803 20210601 20210608.25 20210615.5 20210622.75 20210630 0.000000 -1.202670
Customer_ID 30 57.4 51 31.457 9 32.5 51 90.75 99 -0.045345 -1.382609
Product_ID 30 279.733 276 24.26 251 262.25 276 287.5 357 1.527557 2.302047
Quantity 30 2.067 2 1.143 1 1 2 3 4 0.432925 -1.324588
Unit_Price 30 40.499 36.192 16.743 21.966 32.391 36.192 44.341 117.306 3.305144 12.956859
Sales_Revenue 30 79.688 79.262 43.009 21.966 36.774 79.262 113.762 175.486 0.370341 -0.994856
Product_Description 30 8.133 8 4.531 0 5 8 11 17 0.063229 -0.719756
Product_Category 30 1.867 1.5 1.074 0 1 1.5 3 3 -0.071921 -1.578642
Product_Line 30 2.067 2 0.583 0 2 2 2 3 -1.057225 4.189920
Raw_Material 30 1.9 1 1.647 0 1 1 3 5 0.820911 -0.693793
Region 30 2.1 2 1.668 0 1 2 3 5 0.474169 -0.995923
Latitude 30 52.238 52.192 1.45 50.26 51.063 52.192 53.68 53.958 -0.036826 -1.677367
Longitude 30 -2.27 -1.49 1.367 -5.051 -2.647 -1.49 -1.353 -1.08 -1.277935 0.201208
----------------------------------------------------------------------------------------------------
DataFrame Correlation:

----------------------------------------------------------------------------------------------------
DataFrame Pairplot:

----------------------------------------------------------------------------------------------------
Histogram(s) & Boxplot(s):

----------------------------------------------------------------------------------------------------
Countplot(s):

Data Preprocessing¶

In [13]:
for column in df.columns : 
    plt.figure(figsize = (10,4))
    plt.subplot(121)
    sns.histplot(df[column])
    plt.title(column)
    
    plt.subplot(122)
    stats.probplot(df[column],dist = 'norm', plot = plt)
    plt.title(column)
    plt.show()
In [14]:
from sklearn.preprocessing import QuantileTransformer

qt = QuantileTransformer(output_distribution='normal')

for col in df.columns:
    df[col] = df[col] = qt.fit_transform(pd.DataFrame(df[col]))
In [15]:
for column in df.columns : 
    plt.figure(figsize = (10,4))
    plt.subplot(121)
    sns.histplot(df[column])
    plt.title(column)
    
    plt.subplot(122)
    stats.probplot(df[column],dist = 'norm', plot = plt)
    plt.title(column)
    plt.show()
In [16]:
for column in df.columns : 
    plt.figure(figsize = (10,4))
    plt.subplot(121)
    sns.histplot(df[column])
    plt.title(column)
    
    plt.subplot(122)
    stats.probplot(df[column],dist = 'norm', plot = plt)
    plt.title(column)
    plt.show()

Outliers¶

In [17]:
for column in df:
        plt.figure(figsize=(17,1))
        sns.boxplot(data = df, x = column)
In [18]:
for col in df:
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    iqr = q3 - q1
    whisker_width = 1.5
    lower_whisker = q1 - (whisker_width * iqr)
    upper_whisker = q3 + whisker_width * iqr
    df[col] = np.where(df[col] > upper_whisker, upper_whisker, np.where(df[col] < lower_whisker, lower_whisker, df[col]))
In [19]:
for column in df:
        plt.figure(figsize=(17,1))
        sns.boxplot(data = df, x = column)
In [20]:
df.drop("Product_Line", axis = 1, inplace = True)
df.head()
Out[20]:
Date Customer_ID Product_ID Quantity Unit_Price Sales_Revenue Product_Description Product_Category Raw_Material Region Latitude Longitude
0 -2.700796 1.628361 1.483540 -5.199338 2.700796 0.817237 -1.089662 0.086543 0.399323 2.103162 2.625385 2.837372
1 -1.818646 0.879168 -0.817237 5.199338 -0.701873 1.089662 -1.628361 -0.595179 -0.307293 0.944670 0.043231 -0.307293
2 -1.483540 0.879168 -0.494873 -5.199338 0.043231 -0.701873 -1.628361 -0.595179 -0.307293 0.944670 0.043231 -0.307293
3 -1.262145 2.700796 -2.681732 0.647604 -0.944670 0.217798 -0.307293 -0.595179 -0.307293 0.544341 -0.758293 0.879168
4 -1.089662 0.262283 -2.681732 -5.199338 0.494873 -0.494873 0.879168 5.199338 -0.307293 0.544341 -0.758293 0.879168

Machine Learning¶

In [21]:
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

from tabulate import tabulate
In [22]:
# Extract features and target
X = df.drop("Sales_Revenue", axis = 1)
Y = df["Sales_Revenue"]

# Split the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

Linear Regression¶

In [23]:
regLR = LinearRegression()
regLR.fit(X_train,Y_train)
yPredRegLR=regLR.predict(X_test)
mean_squared_error(yPredRegLR, Y_test)
Out[23]:
0.4044664482334006

Support Vector Regression¶

In [24]:
regSVR=SVR(kernel='linear')
regSVR.fit(X_train,Y_train)
yPredRegSVR=regSVR.predict(X_test)
mean_squared_error(yPredRegSVR, Y_test)
Out[24]:
0.3142226059467846

Decision Tree Regression¶

In [25]:
regDTR=DecisionTreeRegressor()
regDTR.fit(X_train,Y_train)
yPredRegDTR=regDTR.predict(X_test)
mean_squared_error(yPredRegDTR, Y_test)
Out[25]:
0.8018579611142632

Random Forest Regression¶

In [26]:
regRFR=RandomForestRegressor()
regRFR.fit(X_train,Y_train)
yPredRegRFR=regRFR.predict(X_test)
mean_squared_error(yPredRegRFR, Y_test)
Out[26]:
1.0856999572521733

Model Summary¶

In [27]:
data=[
     ["Linear Regression",round(mean_absolute_error(yPredRegLR, Y_test),2),round(np.sqrt(mean_squared_error(yPredRegLR, Y_test)),2),round(r2_score(yPredRegLR, Y_test),2)],
     ["Support Vector Regression",round(mean_absolute_error(yPredRegSVR, Y_test),2),round(np.sqrt(mean_squared_error(yPredRegSVR, Y_test)),2),round(r2_score(yPredRegSVR, Y_test),2)],
     ["Decision Tree Regression",round(mean_absolute_error(yPredRegDTR, Y_test),2),round(np.sqrt(mean_squared_error(yPredRegDTR, Y_test)),2),round(r2_score(yPredRegDTR, Y_test),2)],
     ["Random Forest Regression",round(mean_absolute_error(yPredRegRFR, Y_test),2),round(np.sqrt(mean_squared_error(yPredRegRFR, Y_test)),2),round(r2_score(yPredRegRFR, Y_test),2)] 
    ]
columns=["Model Name","Mean Absolute Error","Root Mean Squared Error","R Squared Error"]

print(tabulate(data, headers=columns, tablefmt="fancy_grid"))
╒═══════════════════════════╤═══════════════════════╤═══════════════════════════╤═══════════════════╕
│ Model Name                │   Mean Absolute Error │   Root Mean Squared Error │   R Squared Error │
╞═══════════════════════════╪═══════════════════════╪═══════════════════════════╪═══════════════════╡
│ Linear Regression         │                  0.52 │                      0.64 │              0.49 │
├───────────────────────────┼───────────────────────┼───────────────────────────┼───────────────────┤
│ Support Vector Regression │                  0.47 │                      0.56 │              0.59 │
├───────────────────────────┼───────────────────────┼───────────────────────────┼───────────────────┤
│ Decision Tree Regression  │                  0.8  │                      0.9  │             -1.41 │
├───────────────────────────┼───────────────────────┼───────────────────────────┼───────────────────┤
│ Random Forest Regression  │                  0.91 │                      1.04 │             -0.71 │
╘═══════════════════════════╧═══════════════════════╧═══════════════════════════╧═══════════════════╛
In [28]:
# Create a DataFrame with actual and predicted values
results_df = pd.DataFrame({
    'Actual Values': Y_test.values,
    'Predicted Values': yPredRegRFR})

results_df.head(10)
Out[28]:
Actual Values Predicted Values
0 -0.043231 0.801596
1 0.043231 0.593894
2 0.817237 -0.754413
3 1.483540 0.781099
4 -0.399323 -0.614952
5 -0.130019 1.447078
In [29]:
# Plotting the linear regression line
plt.scatter(Y_test, yPredRegRFR, alpha=0.7, label='Predicted')
plt.plot([min(Y_test), max(Y_test)], [min(Y_test), max(Y_test)], linestyle='--', color='red', label='Perfect Prediction')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Predicted vs. Actual Values')
plt.legend()
plt.show()